32 research outputs found

    Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

    Full text link
    [EN] We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall¿skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorizationThis research was supported by the Project TIN2017-82972-R from the MINECO (Spain) and the EU H2020 Project 732631 "OPRECOMP. Open Transprecision Computing".Tomás Domínguez, AE.; Quintana-Ortí, ES. (2020). Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors. The Journal of Supercomputing (Online). 76(11):8771-8786. https://doi.org/10.1007/s11227-020-03176-3S877187867611Abdelfattah A, Haidar A, Tomov S, Dongarra J (2018) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Trans Parallel Distrib Syst 29(12):2700–2712. https://doi.org/10.1109/TPDS.2018.2842785Ballard G, Demmel J, Grigori L, Jacquelin M, Knight N, Nguyen H (2015) Reconstructing Householder vectors from tall-skinny QR. J Parallel Distrib Comput 85:3–31. https://doi.org/10.1016/j.jpdc.2015.06.003Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ortí ES (2008) Solving dense linear systems on graphics processors. In: Luque E, Margalef T, Benítez D (eds) Euro-Par 2008—parallel processing. Springer, Heidelberg, pp 739–748Benson AR, Gleich DF, Demmel J (2013) Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp 264–272. https://doi.org/10.1109/BigData.2013.6691583Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3):269–276. https://doi.org/10.1007/BF01436084Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):206–239. https://doi.org/10.1137/080731992Dongarra J, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17. https://doi.org/10.1145/77626.79170Drmač Z, Bujanović Z (2008) On the failure of rank-revealing qr factorization software—a case study. ACM Trans Math Softw 35(2):12:1–12:28. https://doi.org/10.1145/1377612.1377616Fukaya T, Nakatsukasa Y, Yanagisawa Y, Yamamoto Y (2014) CholeskyQR2: A simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th workshop on latest advances in scalable algorithms for large-scale systems, pp 31–38. https://doi.org/10.1109/ScalA.2014.11Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y, Yanagisawa Y (2018) Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices, arXiv:1809.11085Golub G, Van Loan C (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, BaltimoreGunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78. https://doi.org/10.1145/1055531.1055534Joffrain T, Low TM, Quintana-Ortí ES, Rvd Geijn, Zee FGV (2006) Accumulating householder transformations, revisited. ACM Trans Math Softw 32(2):169–179. https://doi.org/10.1145/1141885.1141886Puglisi C (1992) Modification of the householder method based on the compact WY representation. SIAM J Sci Stat Comput 13(3):723–726. https://doi.org/10.1137/0913042Saad Y (2003) Iterative methods for sparse linear systems, 3rd edn. Society for Industrial and Applied Mathematics, PhiladelphiaSchreiber R, Van Loan C (1989) A storage-efficient WY representation for products of householder transformations. SIAM J Sci Comput 10(1):53–57. https://doi.org/10.1137/0910005Stathopoulos A, Wu K (2001) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2182. https://doi.org/10.1137/S1064827500370883Strazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaTomás Dominguez AE, Quintana Orti ES (2018) Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 385–393. https://doi.org/10.1109/PDP2018.2018.00068Volkov V, Demmel JW (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. 202, LAPACK Working Note. http://www.netlib.org/lapack/lawnspdf/lawn202.pdfYamamoto Y, Nakatsukasa Y, Yanagisawa Y, Fukaya T (2015) Roundoff error analysis of the Cholesky QR2 algorithm. Electron Trans Numer Anal 44:306–326Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M097377

    Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors

    Get PDF
    Many practical algorithms for numerical rank computations implement an iterative procedure that involves repeated multiplications of a vector, or a collection of vectors, with both a sparse matrix A and its transpose. Unfortunately, the realization of these sparse products on current high performance libraries often deliver much lower arithmetic throughput when the matrix involved in the product is transposed. In this work, we propose a hybrid sparse matrix layout, named CSRC, that combines the flexibility of some well-known sparse formats to offer a number of appealing properties: (1) CSRC can be obtained at low cost from the popular CSR (compressed sparse row) format; (2) CSRC has similar storage requirements as CSR; and especially, (3) the implementation of the sparse product kernels delivers high performance for both the direct product and its transposed variant on modern graphics accelerators thanks to a significant reduction of atomic operations compared to a conventional implementation based on CSR. This solution thus renders considerably higher performance when integrated into an iterative algorithm for the truncated singular value decomposition (SVD), such as the randomized SVD or, as demonstrated in the experimental results, the block Golub–Kahan–Lanczos algorithm

    Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units

    Full text link
    [EN] We contribute to the optimization of the sparse matrix-vector product by introducing a variant of the coordinate sparse matrix format that balances the workload distribution and compresses both the indexing arrays and the numerical information. Our approach is multi-platform, in the sense that the realizations for (general-purpose) multicore processors as well as graphics accelerators (GPUs) are built upon common principles, but differ in the implementation details, which are adapted to avoid thread divergence in the GPU case or maximize compression element-wise (i.e., for each matrix entry) for multicore architectures. Our evaluation on the two last generations of NVIDIA GPUs as well as Intel and AMD processors demonstrate the benefits of the new kernels when compared with the optimized implementations of the sparse matrix-vector product in NVIDIA's cuSPARSE and Intel's MKL, respectively.J. I. Aliaga, E. S. Quintana-Ortí, and A. E. Tomás were supported by TIN2017-82972-R of the Spanish MINECO. H. Anzt and T. Grützmacher were supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology for providing access to an NVIDIA A100 GPU.Aliaga, JI.; Anzt, H.; Grützmacher, T.; Quintana-Ortí, ES.; Tomás Domínguez, AE. (2022). Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units. Concurrency and Computation: Practice and Experience. 34(14):1-13. https://doi.org/10.1002/cpe.6515113341

    Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD

    Get PDF
    We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case, we revisit the reduction to symmetric band form, while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular– band output. In both cases, we describe algorithmic variants of the standard Level 3 Basic Linear Algebra Subroutines (BLAS)-based procedures, enhanced with lookahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure

    Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs

    Get PDF
    We contribute to the optimization of the sparse matrix-vector product on graphics processing units by introducing a variant of the coordinate sparse matrix layout that compresses the integer representation of the matrix indices. In addition, we employ a look-ahead table to avoid the storage of repeated numerical values in the sparse matrix, yielding a more compact data representation that is easier to maintain in the cache. Our evaluation on the two most recent generations of NVIDIA GPUs, the V100 and the A100 architectures, shows considerable performance improvements over the kernels for the sparse matrix-vector product in cuSPARSE (CUDA 11.0.167).This work was partially sponsored by the EU H2020 project 732631 OPRECOMP and project TIN2017-82972-R of the Spanish MINECO. Hartwig Anzt and Yuhsiang M. Tsai were supported by the “Impuls und Vernetzungsfond” of the Helmholtz Association under grant VH-NG-1241 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology for providing access to an NVIDIA A100 GPU

    FloatX: A C++ Library for Customized Floating-Point Arithmetic

    Full text link
    "© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {45, 4, (2019)} https://dl.acm.org/doi/10.1145/3368086"[EN] We present FloatX (Float eXtended), a C++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back-end to preserve efficiency, and incurs no storage overhead. The article discusses in detail the design principles, programming interface, and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.This work was supported by the CICYT projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the EU H2020 project 732631 "OPRECOMP. Open Transprecision Computing."Flegar, G.; Scheidegger, F.; Novakovic, V.; Mariani, G.; Tomás Domínguez, AE.; Malossi, C.; Quintana-Ortí, ES. (2019). FloatX: A C++ Library for Customized Floating-Point Arithmetic. ACM Transactions on Mathematical Software. 45(4):1-23. https://doi.org/10.1145/3368086S123454Edward Anderson Zhaojun Bai L. Susan Blackford James Demmesl Jack J. Dongarra Jeremy Du Croz Sven Hammarling Anne Greenbaum Alan McKenney and Danny C. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM. Edward Anderson Zhaojun Bai L. Susan Blackford James Demmesl Jack J. Dongarra Jeremy Du Croz Sven Hammarling Anne Greenbaum Alan McKenney and Danny C. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM.Bekas, C., Curioni, A., & Fedulova, I. (2011). Low-cost data uncertainty quantification. Concurrency and Computation: Practice and Experience, 24(8), 908-920. doi:10.1002/cpe.1770Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Figueroa, S. A. (1995). When is double rounding innocuous? ACM SIGNUM Newsletter, 30(3), 21-26. doi:10.1145/221332.221334Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468Mark Gates Piotr Luszczek Ahmad Abdelfattah Jakub Kurzak Jack Dongarra Konstantin Arturov Cris Cecka and Chip Freitag. 2017. C++ API for BLAS and LAPACK. Technical Report 2 ICL-UT-17-03. Mark Gates Piotr Luszczek Ahmad Abdelfattah Jakub Kurzak Jack Dongarra Konstantin Arturov Cris Cecka and Chip Freitag. 2017. C++ API for BLAS and LAPACK. Technical Report 2 ICL-UT-17-03.John Hauser. Accessed March 2019. Berkeley SoftFloat project home page. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html. John Hauser. Accessed March 2019. Berkeley SoftFloat project home page. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html.Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics Philadelphia PA. Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics Philadelphia PA.Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng Lingjia Tang and Jason Mars. 2018. Rethinking numerical representations for deep neural networks. arXiv e-prints (Aug 2018). arXiv:1808.02513. Retrieved from https://openreview.net/forum?id&equals;BJ_MGwqlg8noteId&equals;BJ_MGwqlg. Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng Lingjia Tang and Jason Mars. 2018. Rethinking numerical representations for deep neural networks. arXiv e-prints (Aug 2018). arXiv:1808.02513. Retrieved from https://openreview.net/forum?id&equals;BJ_MGwqlg8noteId&equals;BJ_MGwqlg.Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng etal 2018. Rethinking numerical representations for deep neural networks. 2018. Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng et al. 2018. Rethinking numerical representations for deep neural networks. 2018.IBM. 2015. Engineering and Scientific Subroutine Library. Retrieved from http://www-03.ibm.com/systems/power/software/essl/. IBM. 2015. Engineering and Scientific Subroutine Library. Retrieved from http://www-03.ibm.com/systems/power/software/essl/.IEEE. 2008. IEEE Standard for Floating-point Arithmetic. IEEE Std 754-2008 (Aug. 2008) 1--70. DOI:https://doi.org/10.1109/IEEESTD.2008.4610935 IEEE. 2008. IEEE Standard for Floating-point Arithmetic. IEEE Std 754-2008 (Aug. 2008) 1--70. DOI:https://doi.org/10.1109/IEEESTD.2008.4610935Intel. 2015. Math Kernel Library. Retrieved from https://software.intel.com/en-us/intel-mkl. Intel. 2015. Math Kernel Library. Retrieved from https://software.intel.com/en-us/intel-mkl.ISO. 2017. ISO International Standard ISO/IEC 14882:2017(E)—Programming Language C++. Retrieved from https://isocpp.org/std/the-standard. Visited June 2018. ISO. 2017. ISO International Standard ISO/IEC 14882:2017(E)—Programming Language C++. Retrieved from https://isocpp.org/std/the-standard. Visited June 2018.Lefevre, V. (2013). SIPE: Small Integer Plus Exponent. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.22Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep Learning Face Attributes in the Wild. 2015 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2015.425Érik Martin-Dorel Guillaume Melquiond and Jean-Michel Muller. 2013. Some issues related to double rounding. BIT Num. Math. 53 4 (01 Dec. 2013) 897--924. DOI:https://doi.org/10.1007/s10543-013-0436-2 Érik Martin-Dorel Guillaume Melquiond and Jean-Michel Muller. 2013. Some issues related to double rounding. BIT Num. Math. 53 4 (01 Dec. 2013) 897--924. DOI:https://doi.org/10.1007/s10543-013-0436-2Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Comput. Surv. 48 4 Article 62 (Mar. 2016) 33 pages. DOI:https://doi.org/10.1145/2893356 Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Comput. Surv. 48 4 Article 62 (Mar. 2016) 33 pages. DOI:https://doi.org/10.1145/2893356NVIDIA. 2016. cuBLAS. Retrieved from https://developer.nvidia.com/cublas. NVIDIA. 2016. cuBLAS. Retrieved from https://developer.nvidia.com/cublas.D. O’Leary. 2006. Matrix factorization for information retrieval. Lecture notes for a course on Advanced Numerical Analysis. University of Maryland. Retrieved from https://www.cs.umd.edu/users/oleary/a600/yahoo.pdf. D. O’Leary. 2006. Matrix factorization for information retrieval. Lecture notes for a course on Advanced Numerical Analysis. University of Maryland. Retrieved from https://www.cs.umd.edu/users/oleary/a600/yahoo.pdf.OpenBLAS. 2015. Retrieved from http://www.openblas.net. OpenBLAS. 2015. Retrieved from http://www.openblas.net.Palmer, T. (2015). Modelling: Build imprecise supercomputers. Nature, 526(7571), 32-33. doi:10.1038/526032aAlec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. Retrieved from Arxiv Preprint Arxiv:1511.06434 (2015). Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. Retrieved from Arxiv Preprint Arxiv:1511.06434 (2015).Rubio-González, C., Nguyen, C., Nguyen, H. D., Demmel, J., Kahan, W., Sen, K., … Hough, D. (2013). Precimonious. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’13. doi:10.1145/2503210.2503296Rump, S. M. (2017). IEEE754 Precision- k base-β Arithmetic Inherited by Precision- m Base-β Arithmetic for k < m. ACM Transactions on Mathematical Software, 43(3), 1-15. doi:10.1145/2785965Rybalkin, V., Wehn, N., Yousefi, M. R., & Stricker, D. (2017). Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition. Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. doi:10.23919/date.2017.7927210Giuseppe Tagliavini Stefan Mach Davide Rossi Andrea Marongiu and Luca Benini. 2017. A transprecision floating-point platform for ultra-low power computing. Retrieved from Arxiv Preprint Arxiv:1711.10374 (2017). Giuseppe Tagliavini Stefan Mach Davide Rossi Andrea Marongiu and Luca Benini. 2017. A transprecision floating-point platform for ultra-low power computing. Retrieved from Arxiv Preprint Arxiv:1711.10374 (2017).Tobias Thornes. (2016). Can reducing precision improve accuracy in weather and climate models? Weather 71 6 (02 June 2016) 147--150. DOI:https://doi.org/10.1002/wea.2732 Tobias Thornes. (2016). Can reducing precision improve accuracy in weather and climate models? Weather 71 6 (02 June 2016) 147--150. DOI:https://doi.org/10.1002/wea.2732Van Zee, F. G., & van de Geijn, R. A. (2015). BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software, 41(3), 1-33. doi:10.1145/2764454Todd L. Veldhuizen. 2003. C++ Templates are Turing Complete. Technical Report. Todd L. Veldhuizen. 2003. C++ Templates are Turing Complete. Technical Report.Qiang Xu Todd Mytkowicz and Nam Sung Kim. 2015. Approximate computing: A survey. IEEE Des. Test 33 (01 2015) 8--22. Qiang Xu Todd Mytkowicz and Nam Sung Kim. 2015. Approximate computing: A survey. IEEE Des. Test 33 (01 2015) 8--22.Ziv, A. (1991). Fast evaluation of elementary mathematical functions with correctly rounded last bit. ACM Transactions on Mathematical Software, 17(3), 410-423. doi:10.1145/114697.11681

    Empirical Study and Modeling of Vehicular Communications at Intersections in the 5 GHz Band

    Get PDF
    [EN] Event warnings are critical in the context of ITS, being dependent on reliable and low-delay delivery ofmessages to nearby vehicles. One of the main challenges to address in this context is intersection management. Since buildings will severely hinder signals in the 5GHz band, it becomes necessary to transmit at the exact moment a vehicle is at the center of an intersection to maximize delivery chances. However, GPS inaccuracy, among other problems, complicates the achievement of this goal. In this paper we study this problem by first analyzing different intersection types, studying the vehicular communications performance in each type of intersection through real scenario experiments. Obtained results show that intersection-related communications depend on the distances to the intersection and line-of-sight (LOS) conditions. Also, depending on the physical characteristics of intersections, the presented blockages introduce different degrees of hampering to message delivery. Based on the modeling of the different intersection types, we then study the expected success ratio when notifying events at intersections. In general, we find that effective propagation of messages at intersections is possible, even in urban canyons and despite GPS errors, as long as rooftop antennas are used to compensate for poor communication conditions.This work was partially supported by the “Ministerio de Economía y Competividad, Programa Estatal de Investigación, Desarollo e Innovación Orientada a los Retos de la Sociedad, Proyectos I+D+I 2014,” Spain, under Grants TEC2014-52690-R and BES-2015-075988.Hadiwardoyo, SA.; Tomás Domínguez, AE.; Hernández-Orallo, E.; Tavares De Araujo Cesariny Calafate, CM.; Cano, J.; Manzoni, P. (2017). Empirical Study and Modeling of Vehicular Communications at Intersections in the 5 GHz Band. Mobile Information Systems. (2861827):1-15. https://doi.org/10.1155/2017/2861827S1152861827Xiong, Z., Sheng, H., Rong, W., & Cooper, D. E. (2012). Intelligent transportation systems for smart cities: a progress review. Science China Information Sciences, 55(12), 2908-2914. doi:10.1007/s11432-012-4725-1Papadimitratos, P., La Fortelle, A., Evenssen, K., Brignolo, R., & Cosenza, S. (2009). Vehicular communication systems: Enabling technologies, applications, and future outlook on intelligent transportation. IEEE Communications Magazine, 47(11), 84-95. doi:10.1109/mcom.2009.5307471Grant-Muller, S., & Usher, M. (2014). Intelligent Transport Systems: The propensity for environmental and economic benefits. Technological Forecasting and Social Change, 82, 149-166. doi:10.1016/j.techfore.2013.06.010Ma, X., Chen, X., & Refai, H. H. (2009). Performance and Reliability of DSRC Vehicular Safety Communication: A Formal Analysis. EURASIP Journal on Wireless Communications and Networking, 2009(1). doi:10.1155/2009/969164Martinez, F. J., Toh, C.-K., Cano, J.-C., Calafate, C. T., & Manzoni, P. (2010). A Street Broadcast Reduction Scheme (SBR) to Mitigate the Broadcast Storm Problem in VANETs. Wireless Personal Communications, 56(3), 559-572. doi:10.1007/s11277-010-9989-4Sanguesa, J. A., Fogue, M., Garrido, P., Martinez, F. J., Cano, J.-C., & Calafate, C. T. (2016). A Survey and Comparative Study of Broadcast Warning Message Dissemination Schemes for VANETs. Mobile Information Systems, 2016, 1-18. doi:10.1155/2016/8714142Sommer, C., Joerer, S., Segata, M., Tonguz, O. K., Cigno, R. L., & Dressler, F. (2015). How Shadowing Hurts Vehicular Communications and How Dynamic Beaconing Can Help. IEEE Transactions on Mobile Computing, 14(7), 1411-1421. doi:10.1109/tmc.2014.2362752Lin, J.-C., Lin, C.-S., Liang, C.-N., & Chen, B.-C. (2012). Wireless communication performance based on IEEE 802.11p R2V field trials. IEEE Communications Magazine, 50(5), 184-191. doi:10.1109/mcom.2012.6194401Gozalvez, J., Sepulcre, M., & Bauza, R. (2012). IEEE 802.11p vehicle to infrastructure communications in urban environments. IEEE Communications Magazine, 50(5), 176-183. doi:10.1109/mcom.2012.6194400Tornell, S. M., Patra, S., Calafate, C. T., Cano, J.-C., & Manzoni, P. (2015). GRCBox: Extending Smartphone Connectivity in Vehicular Networks. International Journal of Distributed Sensor Networks, 11(3), 478064. doi:10.1155/2015/478064Chou, L.-D., Yang, J.-Y., Hsieh, Y.-C., Chang, D.-C., & Tung, C.-F. (2011). Intersection-Based Routing Protocol for VANETs. Wireless Personal Communications, 60(1), 105-124. doi:10.1007/s11277-011-0257-zSaleet, H., Langar, R., Naik, K., Boutaba, R., Nayak, A., & Goel, N. (2011). Intersection-Based Geographical Routing Protocol for VANETs: A Proposal and Analysis. IEEE Transactions on Vehicular Technology, 60(9), 4560-4574. doi:10.1109/tvt.2011.2173510Guan, X., Huang, Y., Cai, Z., & Ohtsuki, T. (2015). Intersection-based forwarding protocol for vehicular ad hoc networks. Telecommunication Systems, 62(1), 67-76. doi:10.1007/s11235-015-9983-yKarney, C. F. F. (2011). Transverse Mercator with an accuracy of a few nanometers. Journal of Geodesy, 85(8), 475-485. doi:10.1007/s00190-011-0445-3Durgin, G., Rappaport, T. S., & Hao Xu. (1998). Measurements and models for radio path loss and penetration loss in and around homes and trees at 5.85 GHz. IEEE Transactions on Communications, 46(11), 1484-1496. doi:10.1109/26.729393Haklay, M., & Weber, P. (2008). OpenStreetMap: User-Generated Street Maps. IEEE Pervasive Computing, 7(4), 12-18. doi:10.1109/mprv.2008.8

    Improving Message Delivery Performance in Opportunistic Networks using a Forced-stop diffusion scheme

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-40509-4_11The performance of mobile opportunistic networks strongly depends on contact duration. If the contact lasts less than the required transmission times, some messages will not get delivered, and the whole diffusion scheme will be seriously affected. In this paper we propose a new diffusion method, called Forced-Stop, that is based on controlling node mobility to guarantee a complete message transfer. Using the ONE simulator and realistic mobility traces, we compared our proposal with the classical Epidemic diffusion. We show that Forced-Stop improves the message delivery performance, increasing the delivery ratio up to 30\%, and reducing the latency of message delivery up to 40\%, with a limited impact on buffer utilisation and message relaying. These results can be a relevant indication to the designers of opportunistic network applications that could integrate in their products strategies to inform the user about the need to temporarily stop in order to favor the overall data delivery.This work was partially supported by the Ministerio de Economía y Competitividad, Programa Estatal de Investigación, Desarrollo e Innovación Orientada a los Retos de la Sociedad, Proyectos I+D+I 2014, Spain, under Grant TEC2014-52690-R, the Generalitat Valenciana, Spain, under Grant AICO/2015/108, the Secretaría Nacional de Educación Superior, Ciencia, Tecnología e Innovación del Ecuador(SENESCYT), and the Universidad Laica Eloy Alfaro de Manabi, Ecuador.Herrera Tapia, J.; Hernández Orallo, E.; Tomás Domínguez, AE.; Manzoni, P.; Tavares De Araujo Cesariny Calafate, CM.; Cano Escribá, JC. (2016). Improving Message Delivery Performance in Opportunistic Networks using a Forced-stop diffusion scheme. En Ad-hoc, Mobile, and Wireless Networks. Springer. 156-168. https://doi.org/10.1007/978-3-319-40509-4_11S156168Pelusi, L., Passarella, A., Conti, M.: Opportunistic networking: data forwarding in disconnected mobile ad hoc networks. IEEE Commun. Mag. 44(11), 134–141 (2006)Ferretti, S.: Shaping opportunistic networks. Comput. Commun. 36, 481–503 (2013)Keränen, A., Ott, J., Kärkkäinen, T.: The ONE simulator for DTN protocol evaluation. In: Proceedings of the Second International ICST Conference on Simulation Tools and Techniques, Rome (2009)Tsai, T.-C., Chan, H.-H.: NCCU Trace: social-network-aware mobility trace. IEEE Commun. Mag. 53, 144–149 (2015)AnAverage WhatsApp User Sends Messages per Month, 15 September 2015. http://www.statista.com/chart/1938/monthly-whatsapp-usage-per-userNiu, J., Guo, J., Cai, Q., Sadeh, N., Guo, S.: predict and spread: an efficient routing algorithm for opportunistic networking. In: Wireless Communications and Networking Conference (WCNC), 2011 IEEE, pp. 498–503, Cancún (2011)Thakur, G.S., Kumar, U., Helmy, A., Hsu, W.-J.: On the efficacy of mobility modeling for DTN evaluation: analysis of encounter statistics andspatio-temporal preferences. In: 7th International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 510–515, Istanbul (2011)Warthman, F.: Delay-and Disruption-Tolerant Networks (DTNs) a tutorial, version 2.0. In: The InterPlaNetary (IPN) Internet Project. InterPlanetary Networking Special Interest Group (IPNSIG) (2012)Battestini, A., Setlur, V., Sohn, T.: A large scale study of text messaging use. In: 12th International Conference on Human Computer Interaction with Mobile Devices and Services MobileHCI, pp. 1–10, Lisbon (2010)Förster, A., Garg, K., Nguyen, H.A., Giordano, S. On context awareness and social distance in human mobility traces. In: Third ACM International Workshop on Mobile Opportunistic Networks, pp. 5–12, Zürich (2012)Boldrini, C., Conti, M., Passarella, A.: Modelling data dissemination in opportunistic networks. In: Proceedings of the Third ACM Workshop on Challenged Networks - CHANTS 2008, pp. 89–96, San Francisco (2008)Natalizio, E., Loscrí, V.: Controlled mobility in mobile sensor networks: advantages, issues and challenges. Telecommun. Syst. 52(4), 2411–2418 (2013)Neena, V.V., Rajam, V.M.A.: Performance analysis of epidemic routing protocol for opportunistic networks in different mobility patterns. In: 2013 International Conference on Computer Communication and Informatics, pp. 1–5, Coimbatore (2013)Mehta, N., Shah, M.: Performance evaluation of efficient routing protocols in delay tolerant network under different human mobility models. Int. J. Grid Distrib. Comput. 8(1), 169–178 (2015)Su, J., Chin, A., Popivanova, A., Goel, A., Lara, E.D.: User mobility for opportunistic ad-hoc networking. In: Sixth IEEE Workshop on Mobile Computing Systems and Applications (WMCSA 2004), Low Wood (2004)Feng, Z., Chin, K.-W.: A unified study of epidemic routing protocols and their enhancements. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1484–1493, Shanghai (2012)Vardalis, D., Tsaoussidis, V.: Exploiting the potential of DTN for energy-efficient internetworking. J. Syst. Softw. 90, 91–103 (2014)Rango, F.D., Amelio, S., Fazio, P.: Epidemic strategies in delay tolerant networks from an energetic point of view: main issues and performance evaluation. J. Networks 10(01), 4–14 (2015)Herrera-Tapia, J., Manzoni, P., Hernández-Orallo, E., Calafate, C.T., Cano, J.-C.: Power consumption evaluation in vehicular opportunistic networks. In: IEEE 12th CCNC 2015 Workshops - VENITS, pp. 925–930, Las Vegas (2015)Erramilli, V., Crovella, M.: Forwarding in opportunistic networks with resource constraints. In: Proceedings of the third ACM workshop on Challenged networks - CHANTS 2008, pp. 41–47, San Francisco (2008)Fathima, G., Wahidabanu, R.: Buffer management for preferential delivery in opportunistic delay tolerant networks. Int. J. Wirel. Mob. Netw. (IJWMN) 3, 15–28 (2011)Pan, D., Ruan, Z., Zhou, N., Liu, X., Song, Z.: A comprehensive-integrated buffer management strategy for opportunistic networks. EURASIP J. Wirel. Commun. Netw. 2013(1), 1–10 (2013)Hernández-Orallo, E., Herrera-Tapia, J., Cano, J.-C., Calafate, C.T., Manzoni, P.: Evaluating the impact of data transfer time in contact-based messaging applications. IEEE Commun. Lett. 19, 1814–1817 (2015)de Abreu, C.S., Salles, R.M.: Modeling message diffusion in epidemical DTN. Ad Hoc Netw. 16, 197–209 (2014

    Reformulating the direct convolution for high-performance deep learning inference on ARM processors

    Get PDF
    We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.Funding for open access charge: CRUE-Universitat Jaume

    Fast inexact mapping using advanced tree exploration on backward search methods

    Full text link
    Background: Short sequence mapping methods for Next Generation Sequencing consist on a combination of seeding techniques followed by local alignment based on dynamic programming approaches. Most seeding algorithms are based on backward search alignment, using the Burrows Wheeler Transform, the Ferragina and Manzini Index or Suffix Arrays. All these backward search algorithms have excellent performance, but their computational cost highly increases when allowing errors. In this paper, we discuss an inexact mapping algorithm based on pruning strategies for search tree exploration over genomic data. Results: The proposed algorithm achieves a 13x speed-up over similar algorithms when allowing 6 base errors, including insertions, deletions and mismatches. This algorithm can deal with 400 bps reads with up to 9 errors in a high quality Illumina dataset. In this example, the algorithm works as a preprocessor that reduces by 55% the number of reads to be aligned. Depending on the aligner the overall execution time is reduced between 20–40%. Conclusions: Although not intended as a complete sequence mapping tool, the proposed algorithm could be used as a preprocessing step to modern sequence mappers. This step significantly reduces the number reads to be aligned, accelerating overall alignment time. Furthermore, this algorithm could be used for accelerating the seeding step of already available sequence mappers. In addition, an out-of-core index has been implemented for working with large genomes on systems without expensive memory configurations.The authors would like to thank the Universitat Politecnica de Valencia (Spain) in the frame of the grant "High-performance tools for the alignment of genetic sequences using graphic accelerators (GPGPUs)/Herramientas de altas prestaciones para el alineamiento de secuencias geneticas mediante el uso de aceleradores graficos (GPGPUs)", research program PAID-06-11, code 2025.Salavert Torres, J.; Tomás Domínguez, AE.; Tárraga Giménez, J.; Medina Castelló, I.; Dopazo Blazquez, J.; Blanquer Espert, I. (2015). Fast inexact mapping using advanced tree exploration on backward search methods. BMC Bioinformatics. 16(18):1-11. https://doi.org/10.1186/s12859-014-0438-3S1111618Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Biol. 2010; 11(5):473–83.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162:705–8.Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press: Cambridge; 1998. [ http://books.google.es/books?id=R5P2GlJvigQC ]Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005; 52(4):552–81. doi:10.1145/10820361082039Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical Report 124. (SRC Digital, DEC Palo Alto); May 1994Manzini G. An analysis of the burrows-wheeler transform. In: Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms. NY: ACM-SIAM: 1999. p. 669–77.Ferragina P, Manzini G. Opportunistic data structures with applications. In: FOCS: 2000. p. 390–398.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009; 25(14):1754–1760.Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, et al. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009; 25(15):1966–1967. doi:10.1093/bioinformatics/btp336.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:(R25).Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, et al. Soap3-dp: Fast, accurate and sensitive gpu-based short read aligner. PLoS ONE. 2013; 8(5):65632. doi:10.1371/journal.pone.0065632Liu Y, Schmidt B. Long read alignment based on maximal exact match seeds. Bioinformatics. 2012; 28(18):318–324. doi:10.1093/bioinformatics/bts414Klus P, Lam S, Lyberg D, Cheung M, Pullan G, McFarlane I, et al. Barracuda - a fast short read sequence aligner using graphics processing units. BMC Res Notes. 2012; 5(1):27. doi:10.1186/1756-0500-5-27Salavert J, Blanquer I, Andrés T, Vicente H, Ignacio M, Joaquín T, et al. Using gpus for the exact alignment of short-read genetic sequences by means of the burrows-wheeler transform. IEEE/ACM Trans Comput Biol Bioinf. 2012; 9(4):1245–56. doi:10.1109/TCBB.2012.49Xin Y, Liu B, Min B, Li WXY, Cheung RCC, Fong AS, et al. Parallel architecture for {DNA} sequence inexact matching with burrows-wheeler transform. Microelectron J. 2013; 44(8):670–82. doi:10.1016/j.mejo.2013.05.004Manber U, Myers G. Suffix arrays: A new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’90Philadelphia, PA, USA: Society for Industrial and Applied Mathematics: 1990. p. 319–327. http://dl.acm.org/citation.cfm?id=320176.320218Abouelhoda MI, Kurtz S, Ohlebusch E. The enhanced suffix array and its applications to genome analysis. In: Proc. Workshop on Algorithms in Bioinformatics, in Lecture Notes in Computer Science,Heidelberger, Berlin: Springer: 2002. p. 449–63.Vyverman M, De Baets B, Fack V, Dawyndt P. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics. 2013; 29(6):802–4. doi:10.1093/bioinformatics/btt042Oguzhan Kulekci M, Hon W-K, Shah R, Scott Vitter J, Xu B. Psi-ra: a parallel sparse index for genomic read alignment. BMC Genomics. 2011; 12(Suppl 2):7. doi:10.1186/1471-2164-12-S2-S7Sadakane K. New text indexing functionalities of the compressed suffix arrays. J Algorithms. 2003; 48(2):294–313. doi:10.1016/S0196-6774(03)00087-7Liu C-M, Wong T, Wu E, Luo R, Yiu S-M, Li Y, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics. 2012; 28(6):878–9. doi:10.1093/bioinformatics/bts061. http://bioinformatics.oxfordjournals.org/content/28/6/878.full.pdf+htmlLam TW, Li R, Tam A, Wong S, Wu E, Yiu SM. High throughput short read alignment via bi-directional bwt. In: IEEE International Conference On Bioinformatics and Biomedicine, 2009. BIBM ’09.,Washington, D.C., USA: IEEE Computer Society Press: 2009. p. 31–6. doi:10.1109/BIBM.2009.42Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26(5):589–95. doi:10.1093/bioinformatics/btp698Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Meth. 2012; 9(4):357–9. doi:10.1038/nmeth.1923Mu JC, Jiang H, Kiani A, Mohiyuddin M, Asadi NB, Wong WH. Fast and accurate read alignment for resequencing. Bioinformatics. 2012; 28(18):2366–73. doi:10.1093/bioinformatics/bts450Ning Z, Cox AJ, Mullikin JC. Ssaha: A fast search method for large dna databases. Genome Res. 2001; 11(10):1725–9. doi:10.1101/gr.194201Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Meth. 2012; 9(12):1185–8. doi:10.1038/nmeth.2221Sadakane K. A library for compressed full-text indexes. https://code.google.com/p/csalib/ (2010)Mäkinen V, Navarro G, Sadakane K. Advantages of backward searching; efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proceedings of the 15th International Conference on Algorithms and Computation. ISAAC’04,Berlin, Heidelberg: Springer: 2004. p. 681–92. doi:10.1007/978-3-540-30551-4_59. http://dx.doi.org/10.1007/978-3-540-30551-4_59Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix array construction algorithms. ACM Comput Surv. 2007; 39(2). doi:10.1145/1242471.1242472Okanohara D, Sadakane K. A linear-time burrows-wheeler transform using induced sorting. In: Karlgren J, Tarhio J, Hyyrö H, editors. String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721. Heidelberg, Berlin: Springer: 2009. p. 90–101.Grossi R, Vitter J. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SICOMP: SIAM J Comput. 2005; 35(2):378–407
    corecore